Example-Based DB-Outlier Detection from High Dimensional Datasets
نویسندگان
چکیده
Outlier detection is an important problem that has applications in many fields. High dimensional datasets are common in such applications. Among the existing outlier detection methods, Distance-Based outlier (DB-Outlier) detection is one of the most generalizable and simplest approaches. It finds outliers by calculating distances between data points. However, in high dimensional space, data distribution is sparse, so every data point becomes a good outlier candidate. It has been shown that meaningful outliers are likely to be identified by examining the behavior of the data in low dimensional projections. On the other hand, Example-Based outlier detection method is promising in discovering the hidden user view of outliers. In this paper, we present a new method to detect DB-Outliers in high dimensional datasets based on user examples. The method finds a subspace where user examples are outstanding more significantly than in any other subspaces, and reports outliers detected in this subspace.
منابع مشابه
A Robust Method for Detecting DB-Outliers from High Dimensional Datasets
Outlier detection is a popular technique that can be utilized in many modern applications like financial analysis and fraud detection. As data description becomes complex, operated datasets’ dimensionalities keep monotone increasing. However, current researches find that it is extremely difficult to pick out outliers directly from high dimensional datasets owing to the curse of dimensionality. ...
متن کاملA Fast Randomized Method for Local Density-Based Outlier Detection in High Dimensional Data
Local density-based outlier (LOF) is a useful method to detect outliers because of its model free and locally based property. However, the method is very slow for high dimensional datasets. In this paper, we introduce a randomization method that can computer LOF very efficiently for high dimensional datasets. Based on a consistency property of outliers, random points are selected to partition a...
متن کاملRandom Subspace Learning Approach to High-Dimensional Outliers Detection
We introduce and develop a novel approach to outlier detection based on adaptation of random subspace learning. Our proposed method handles both high-dimension low-sample size and traditional low-dimensional high-sample size datasets. Essentially, we avoid the computational bottleneck of techniques like Minimum Covariance Determinant (MCD) by computing the needed determinants and associated mea...
متن کاملDetecting outliers in high-dimensional neuroimaging datasets with robust covariance estimators
Medical imaging datasets often contain deviant observations, the so-called outliers, due to acquisition or preprocessing artifacts or resulting from large intrinsic inter-subject variability. These can undermine the statistical procedures used in group studies as the latter assume that the cohorts are composed of homogeneous samples with anatomical or functional features clustered around a cent...
متن کاملDistance - based outliers : algorithms and applicationsEdwin
This paper deals with nding outliers (exceptions) in large, multidimensional datasets. The iden-tiication of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce , credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for nding outliers can only deal eeciently with two dime...
متن کامل